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Abstract 


The main goal of our project is to identify the emotions a speaker evokes when 
speaking. For example, utterances uttered in states of fear, surprise, excite- 
ment, anger, or joy are loud and fast and have a large and wide pitch range, 
whereas utterances uttered in states of depression or fatigue are slow and deep. 
This is us We use deep learning techniques to build models that can identify 
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MFCCS; main reason for choosing this project is that speech sentiment analysis has 


mel-Spectrograms become one of the largest commercialization strategies in which client moods 
and dispositions play a large role. Therefore, there is an increased demand for 
products or companies to recognize an individual’s emotions and recommend 
appropriate products or assist him accordingly. It can also be used to monitor 
status. More recently, speech recognition and analysis have also been applied 


to medicine and forensics. 


1. Introduction 


Systems for recognizing speech emotions (SER) 
have developed from a specialized field to a crucial 
component of human-computer interaction (HCI). 
Instead of using conventional devices as input to 
understand rhetorical content and make it simple 
for human listeners to acknowledge, these HCI sys- 
tems aim to speed up innate communication with 
machines through explicit speech interaction. In 
some applications, dialogue systems for lingual 
languages are used for call center consultations, 
music recommendation systems are made based on 
the user’s mood, and emotion analysis from the 
speech is used in medical and forensic applica- 
tions. (Senthilkumar et al.) However, there are many 
difficulties with HCIsystems, including noisy set- 
tings and different speaker accents that cause ambi- 
guity that still needs to be properly resolved. 


OPEN ACCESS 


2. Literature Survey 


Edward Jones et al (Jones) have presented a paper on 
Speech Emotion Recognition Using Deep Learning 
Techniques: A Review. These methods offer easy 
model training as well as the efficiency of shared 
weights. Limitations of deep learning techniques 
include their large layer-wise internal architecture, 
less efficiency for temporally varying input data, 
and overlearning during memorization of layer-wise 
information. 

Ron Hoory et al (Hoory) have presented a 
paper on Speech Emotion Recognition Using Self- 
Supervised Features. They have clearly shown that 
well-designed combinations of carefully fine-tuned 
and averaged Upstream models and averaged Down- 
stream models can significantly improve the perfor- 
mance of E2E SER models. This research paper 
aims to introduce a modular End To-End(E2E) SER 
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system based on an Upstream + Downstream archi- 
tecture model paradigm 


Mira Kartiwi et al (Kartiwi) have presented a 
paper on A Comprehensive Review of Speech Emo- 
tion Recognition Systems.This paper points out 
that deep learning techniques are considered best 
suited for the SER system over traditional tech- 
niques because of their advantages like scalability, 
all-purpose parameter fitting, and infinitely flexible 
function. 


Srinivasa Parthasarathy et al (Parthasarathy) have 
presented a paper on Semi-Supervised Speech Emo- 
tion Recognition With LadderNetworks. The out- 
comes showed substantial increases when utilizing 
the suggested models, supporting the generalizabil- 
ity of the ladder networks.. The improvements were 
particularly high when using unlabeled data from 
the target domain, exploiting all the benefits of the 
proposed architecture. 


3. Design 
3.1. System Architecture 


FIGURE 1 depicts the overall architecture of the 
project where input is taken in the form of speech 
or audio. Then we have to extract features from the 
input. Some of the main audio features extracted are 
Mel-frequency Cepstral Coefficient(MFCC), pitch, 
Mel-Spectrograms, chrome, zero-crossing rate, etc. 
Then the dataset is split into the training set and test- 
ing set. The very next step is training Deep-learning 
and Artificial Intelligence models using each feature 
extracted from the training dataset. A testing dataset 
is used to evaluate the developed model. Then we 
will save the model which gives the best accuracy. 
then we will be connecting the model to the inter- 
face and output the predicted emotion. 


3.2. Data Flow Diagram 


FIGURE 2 illustrates the data flow diagram of the 
project. Once the SER model is trained and tested, 
it is exported or embedded into an app. When you 
start the application it prompts the user to give input 
using Google Speech API. Input data is sent to the 
server via HTTP post request where it receives input 
and does feature extraction and tests extracted fea- 
tures using an already trained SER model/Then it 
predicts emotion and returns a JSON response. 
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3.3. Backend Architecture 


FIGURE 3 depicts the backend architecture of our 
project where standard datasets are taken as input 
and divided into training samples and testing sam- 
ples. Training samples undergo pre-processing such 
as converting audio waves to melspectogram and 
data augmentation which increases the diversity of 
the dataset by using standard augmentation tech- 
niques such as changing pitch, injecting noise, etc. 
The next step is feature extraction which extracts 
features such as MFCC, chroma, and Mel-frequency 
spectrograms. These extracted features are then sent 
to classifiers for predicting emotion. To evaluate 
the model, we will be using testing samples that 
undergo feature extraction and are sent to classifiers 
for predicting emotion. We can also test by pro- 
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viding live input via google speech recognition API 
which undergoes feature extraction and is sent to the 
model for predicting emotion. 
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FIGURE 3. Backend Architecture of Speech 
Emotion Recognition System 


4. Methodology 
4.1. Existing system 


The majority of the SER systems presently on 
the market employ traditional machine learning 
algorithms for emotion recognition, including Sup- 
port Vector Machines (SVM), K-Nearest Neighbors 
(KNN), Gaussian Mixture Model (GMM), etc.. The 
accuracies of these models are low and have high 
computational complexity. However, there are other 
deep learning models such as Convolutional Neural 
Network(CNN), Quaternion Convolutional Neu- 
ral Network(QCNN), and, Long-Short Term mem- 
ory(LSTM), etc which give average accuracy of 
around 80 percent because of their computational 
complexity and many other reasons. 


4.2. Proposed System 


We propose an enhanced speech emotion recogni- 
tion method that uses hybrid model of deep neural 
networks that is, CNN and LSTM to detect emo- 
tions elicited by the speaker.the method used Mel- 
frequency cepstral coefficients (MFCC), chromo- 
gram, Mel scale spectrogram in conjunction with 
spectral contrast to extract details about an audio 
file. These features are used to train our hybrid 
model which gives better accuracy compared to 
other existing models. The model classifies the 
speech audio in 8 different emotions such as neutral 
, calm , surprise , happy , anger , fearful , disgust, 
sad. 


4.3. Proposed Methodology 


The system’s architecture makes it clear that we are 
using voice training. and it is then passed for pre- 
processing for the feature extraction of the sound 
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which then gives the training arrays.These arrays are 
then used to form “classifiers “for making decisions 
about the emotion. So, a big data set of voices of 
different emotions is needed for the training sample. 
We searched on the web and found different sets of 
datasets some of which are mentioned below: 

1.Crowd-sourced Emotional Multimodal Actors 
Dataset(Crema-D) 

2. Ryerson Audio-Visual Database of Emotional 
Speech and Song (Ravdess) 

3. Surrey Audio-Visual Expressed Emotion 
(Savee) 

4. Toronto emotional speech set (Tess) 

To begin with we created data frames and then 
the later step was data visualization and exploration 
wherein we have a wave plot and spectrogram of 
the audio input we have. Data augmentation is 
the process of creating new synthetic data samples 
by adding small perturbations to our initial train- 
ing set.To generate syntactic data for audio, we can 
add noise, change the time, the pitch, and the pace. 
The objective is to make our model invariant to 
those perturbations and enhance its ability to gen- 
eralize. In order for this to work adding the per- 
turbations must conserve the same label as the origi- 
nal training sample.In images data augmentation can 
be performed by shifting the image, zooming, rotat- 
ing.The next step being feature extraction where in 
we have extracted 5 features which are Zero cross- 
ing rate, chroma stft, RMS(root mean square)value, 
Mel Spectrogram to train our data Then on, the data 
preparation step where we have split the datasets 
into training and testing datasets.Further we have 
used the machine learning models such as Deci- 
sion tree, KNN, MLP Classifier, LSTM and CNN 
Wherein, we have generated the confusion matrix 
and a classification report for each of the model. 


5. Experimentation 

5.1. Data Augmentation 

FIGURE 4 shows wave plots of audio files after 
applying different data augmentation techniques 
such as injecting noise, stretching audio, and chang- 
ing the pitch of audio in order to increase the diver- 
sity of the dataset and lessen the model’s overfitting. 


5.2. Feature Extraction 


FIGURE 5 includes different features extracted 
from audio. 
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1. Mel-Spectrograms, which represent sound or 
audio on a mel scale. The mel scale is used because 
humans perceive sound differently from machines, 
which have a resolution that is the same across all 
frequencies as opposed to our higher resolution at 
lower frequencies. (Muppidi and Radfar) We con- 
vert our audio frequency to mel frequency because 
it has been found that simulating the human hearing 
characteristic during feature extraction improves the 
model’s accuracy. 

2. Chroma: A spectrogram is projected onto 12 
bins to represent the 12 distinct semitones in the 
standard audio representation of audio. (Aftab et al.) 
On a typical chromatic scale, it displays the energy 
of each pitch that is present in the signal. 

3. The zero-crossing rate is the speed at which 
a positive signal turns negative and vice versa. The 
frequency of the signal crossing the horizontal axis 
is another way to conceptualize it. 

4. MECC: It represents the short-time power 
spectrum envelope, which represents the vocal 
tract’s shape. 

5. RMS value: One of the most crucial parame- 
ters, it shows the signal’s strength or power. 


5.3. Data Pre-Processing 


In this Data-preprocessing, we will be loading fea- 
tures into the X variable and emotions into the 
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Y variable. Since detecting the emotions of the 
speaker is a multiclass classification problem, we 
will be using a one-hot encoding technique by which 
categorical data are converted into binary features of 
data (Prasomphan) . Then we will be splitting the 
dataset into the training set and testing set. In our 
project, 75 percent is training data, and the rest 25 
percent is testing data. then we will be standardizing 
data using StandardScaler to make sure all variables 
contribute equally. 


6. Result And Analysis 
6.1. Decision Tree 


In [67]: #decision tree 
from sklearn.tree import DecisionTreeClassifier 
clf3 = DecisionTreeClassifier() 


clf3 = clf3.fit(x_train,y train) 


y_pred = clf3.predict(x_test) 


In [68]: print("Training set score: {:.3f}".format(clf3.score(x_train, y_train))) 
print("Test set score: {:.3f}".format(clf3.score(x_test, y_test)) 


Training set score: 1.000 
Test set score: 0.533 


FIGURE 6. Training 


FIGURE 6 depicts training and a test score of the 
decision tree model. It is observed that the train- 
ing score is 100 percent which is unusual whereas 
the test score is around 40 percent which indicates 
the model is overfitting. (Singh and Goel) This is 
caused because the model is memorizing exact input 
and output pairs in training data instead of learning 
patterns. So, it underperforms when evaluated on 
test data. 


6.2. KNN 


FIGURE 7 illustrates training and test scores of the 
KNN model. It is observed that the training score 
is around 49 percent and the test score is around 
37 percent. This accuracy is not good for deploy- 
ment. (Vamshi and Krishna) The low test score is 
probably because the testset may contain new fea- 
tures which are not present in the training set and 
one more reason is the KNN is not good for large 
datasets because it is very costly to calculate the dis- 
tance between existing points and new points which 
degrades model performance. 


6.3. MLP Classifier 


FIGURE 8 depicts the training and test scores of 
the MLP Classifier. It is observed that the training 
score is around 80 percent which is good but the test 
score is around 50 percent which makes the model 
not suitable for deployment. 
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1 [69]: #knn 
from sklearn.neighbors import KNeighborsClassifier 
clfi=KNeighborsClassifier(n_neighbors=4) 
clfi.fit(x_train,y_train) 


it[69]: KNeighborsClassifier(n_neighbors=4) 
1 [70]: y_pred=clf1.predict(x_test) 


1 [71]: print("Training set score: {:.3f}".format(clf1.score(x_train, y_trai 
print("Test set score: {:.3f}".format(clf1.score(x_test, y_test))) 


Training set score: 0.492 
Test set score: 0.372 


FIGURE 7. Training 


network import MLPClassifier 
Iphaz®.01, batch_size=270, epsilon=le-08, hidden_layer_sizes=(400,), learning_rate='adaptive', max_iter= 
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‘in))) 


FIGURE 8. Training 


6.4. LSTM 


FIGURE 9 illustrates the confusion matrix and the 
actual-predicted output of the LSTM model. It has 
been found that the model’s accuracy is approxi- 
mately 67%.Input data used in our project is sequen- 
tial data and the LSTM model is predominantly used 
for this kind of data since it can remember long-term 
dependencies between time steps of data. (Aggarwal 
et al.) Training accuracy was good but it underper- 
formed on testing data. This is because LSTMs are 
easily overfitted and implementing dropouts is quite 
difficult. 
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FIGURE 9. Confusion 


6.5. CNN 


FIGURE 10 illustrates the confusion matrix and the 
actual-predicted output of the CNN model. It is 
observed that the accuracy of the model is around 62 
percent from the classification report mentioned in 
FIGURE 11 clearly. We can see our model is more 
accurate in predicting surprise, and angry emotions 
and it makes sense also because audio files of these 
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emotions differ from other audio files in a lot of 
ways like pitch, speed, etc. 
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FIGURE 10. Confusion 


precision recall f1-score support 
angry @.73 @.76 8.74 1438 
calm 8.69 @.77 8.73 137 
disgust 8.56 8.54 @.55 1468 
fear 8.60 @.55 @.57 1424 
happy @.57 @.58 @.57 1462 
neutral 8.62 8.56 8.58 131¢e 
sad 8.68 @.67 @.63 14¢ee 
surprise @.83 @.86 @.84 483 
accuracy @.62 9122 
macro avg @.65 @.66 @.65 9122 
weighted avg @.62 @.62 8.62 9122 
FIGURE 11. Classification 
6.6. CNN-LSTM 
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FIGURE 12. Confusion Matrix And Output 
Table Of CNN-LSTMModel 


FIGURE 12 illustrates the confusion matrix and 
the actual-predicted output of the CNN-LSTM 
model. It is observed that the accuracy of the 
model is around 94 percent from the classifica- 
tion report mentioned in FIGURE13 clearly. This 
has overall good accuracy because the CNN-LSTM 
hybrid model was used for speech emotion detection 
where CNN extracts features and LSTM will handle 
sequential learning. 
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print(classification_report(y_test_, y_pred)) 


precision recall f1-score support 
angry @.95 6.96 @.95 1574 
calm @.93 @.96 @.94 119 
disgust @.92 @.92 @.92 1545 
fear Q@.94 8.93 @.93 1531 
happy Q@.94 @.93 @.93 1531 
neutral @.93 @.92 @.92 1335 
sad @.92 @.95 @.93 1527 
surprise 28.98 8.99 @.98 683 
accuracy 8.94 9845 
macro avg Q@.94 @.94 @.94 9845 
weighted avg Q@.94 8.94 @.94 9845 
FIGURE 13. Classification Report Of CNN- 
LSTMModel 
MODELS AVERAGE ACCURACY 
GMM(Gausian mixture model) 72.61% 
SVM(support vector machine) 78.16% 
MLP(multilayer perceptron) 71.87% 
Decision Tree 
Knn(k-nearest neighbours) 
MDT(Meta Decision Tree) 80% 
Q-CNN(Quaternion Convolutional neural 77.97% 
network) 
LSTM(Long short term memory) 67% 
CNN(Convolutional Neural Network) 62% 
CNN-LSTM 94% 


FIGURE 14. Comparison Of Results 


7. Comparison Of Results 


FIGURE 14 illustrates the comparative study of 
the various classification models. The addition of 
behavioral features has improved the accuracy of 
the proposed model. After a thorough compara- 
tive study of different classification models, CNN- 
LSTM model has given the best accuracy. 


8. Conclusion 


In conclusion, because of its potential applications 
in a variety of domains, including human-computer 
interaction, healthcare, and psychology, the devel- 
opment of speech-emotion recognition (SER) sys- 
tems has grown in importance as a research issue. 
The goal of SER is to automatically ascertain a 
speaker’s emotional state from their speech signal. 
In this article, we covered the speech pre- 
processing, feature extraction, and classification 
processes that make up a typical SER system. We 
covered a number of methods for each of these 
elements, including feature extraction methods like 
Mel frequency cepstral coefficients (MFCCs), signal 
processing methods like filtering, and classification 
algorithms like support vector machines (SVMs) 
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and deep learning models. 

Although SER has generally made tremendous 
progress in recent years, there is still much need for 
improvement. 

Overall, there is still much opportunity for 
improvement even though SER has made significant 
progress in recent years. To increase the reliabil- 
ity and accuracy of SER systems and to make them 
viable for use in real-world circumstances, more 
research and development is required. 
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